Punctuation Normalisation for Cleaner Treebanks and Parsers

نویسندگان

  • Daniel Tse
  • James R. Curran
چکیده

Although punctuation is pervasive in written text, their treatment in parsers and corpora is often second-class. We examine the treatment of commas in CCGbank, a wide-coverage corpus for Combinatory Categorial Grammar (CCG), reanalysing its comma structures in order to eliminate a class of redundant rules, obtaining a more consistent treebank. We then eliminate these rules from C&C, a wide-coverage statistical CCG parser, obtaining a 37% increase in parsing speed on the standard CCGbank test set and a considerable reduction in memory consumed, without affecting parser accuracy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Training Parsers on Incompatible Treebanks

We consider the problem of training a statistical parser in the situation when there are multiple treebanks available, and these treebanks are annotated according to different linguistic conventions. To address this problem, we present two simple adaptation methods: the first method is based on the idea of using a shared feature representation when parsing multiple treebanks, and the second met...

متن کامل

Dependency parsing representation effects on the accuracy of semantic applications ― an example of an inflective language

In this paper we investigate how different dependency representations of a treebank influence the accuracy of the dependency parser trained on this treebank and the impact on several parser applications: named entity recognition, coreference resolution and limited semantic role labeling. For these experiments we use Latvian Treebank, whose native annotation format is dependency based hybrid aug...

متن کامل

Preparing, Restructuring, and Augmenting a French Treebank: Lexicalised Parsers or Coherent Treebanks?

We present the Modified French Treebank (MFT), a completely revamped French Treebank, derived from the Paris 7 Treebank (P7T), which is cleaner, more coherent, has several transformed structures, and introduces new linguistic analyses. To determine the effect of these changes, we investigate how theMFT fares in statistical parsing. Probabilistic parsers trained on the MFT training set (currentl...

متن کامل

One model, two languages: training bilingual parsers with harmonized treebanks

We introduce an approach to train lexicalized parsers using bilingual corpora obtained by merging harmonized treebanks of different languages, producing parsers that can analyze sentences in either of the learned languages, or even sentences that mix both. We test the approach on the Universal Dependency Treebanks, training with MaltParser and MaltOptimizer. The results show that these bilingua...

متن کامل

LTAG-spinal treebank and parser for Hindi

Statistical parsers need huge annotated treebanks to learn from and building treebanks is an expensive proposition. To create parsers for different grammar formalisms in a language, building separate treebanks for each of those isn’t a feasible task. Treebanks available in one formalism can be converted into an other either automatically or with minimal human effort by exploiting the similariti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008